Coding (and Research) Tips

Jacob Toner Gosselin

2026-01-16

Outline

  1. Instrumental Variables (IV)
    • The DAG and Endogeneity Problem
    • The Solution
    • Estimation (2SLS)
    • Example: Macro Question
  2. Difference-in-Differences (DiD)
    • The DAG and Endogeneity Problem
    • The Solution
    • Estimation (TWFE)
    • Examples: Static and Dynamic DiD

Instrumental Variables

IV: The DAG and Endogeneity Problem

  • Consider an RCT. Random assignment R that determines X. So even though we have back doors between X and Y, we can identify X -> Y
  • Idea of IV: can we find variable Z that takes the place of R?

IV: The Solution

  • Consider standard linear model: \[ Y = \beta X + \varepsilon \]
  • Assume (1) \(E[X|Z] \neq 0\) and (2) \(E[\varepsilon|Z] = 0\) \[ E[Y|Z] = \beta E[X|Z] + E[\varepsilon|Z] \]
  • Mechanically, this corresponds to:
    1. Explain X with Z, and keep only what is explained, X'
    2. Explain Y with Z, and keep only what is explained, Y'
    3. Get the correlation between X' and Y'

IV: The Solution (visualized)

IV: Estimation (2SLS)

Most commonly this is estimated using two stage least squares

  1. Use the instruments and controls to explain \(X\) in the first stage
  2. Use the controls and the predicted (explained) part of \(X\) in place of \(X\) in the second stage
  3. (do some standard error adjustments)

Many ways to do this in R, I’ll be doing 2SLS with feols() from fixest

Example 1: Macro Question!

  • How does US income affect US expenditures (“marginal propensity to consume”)?
  • We can instrument with investment from LAST year.
library(AER)
#US income and consumption data 1950-1993
data(USConsump1993)
USC93 <- as.data.frame(USConsump1993)
#lag() gets the observation above; here the observation above is last year
IV <- USC93 %>% mutate(lastyr.invest = lag(income) - lag(expenditure)) 
# 2SLS estimation
m_iv <- feols(expenditure ~ 1 | income ~ lastyr.invest, data = IV, se = 'hetero')

Example 1: Macro Question!

tinytable_im9r7bmit3hb8wvcu97h
Income (First Stage) Expenditure
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Expenditure 0.892***
(0.009)
Lagged Investment 8.210***
(0.620)
Num.Obs. 43 43
Std.Errors Heteroskedasticity-robust Heteroskedasticity-robust

Example 1: Stata Replication

* Load the data
import delimited "usconsump1993.csv", clear

* 2SLS estimation: instrument income with lagged investment
* ivregress 2sls depvar (endogenous = instruments), vce(robust)
ivregress 2sls expenditure (income = lastyr_invest), vce(robust)

* View first stage
estat firststage

Difference-in-Differences

DiD: The DAG and Endogeneity Problem

  • We compare the time before the policy to the time after
  • But if anything else is changing over time, we have a problem
  • Need a control group that is not treated

DiD: The Solution

  • Before-After Difference for Untreated: \[ E[Y | U, A] - E[Y | U, B] = Time \]
  • Before-After Difference for Treated: \[ E[Y | T, A] - E[Y | T, B] = Time + Trmt \]
  • Difference-in-Differences: \[ (E[Y | T, A] - E[Y | T, B]) - (E[Y | U, A] - E[Y | U, B]) = Trmt \]

DiD: The Solution (visualized)

DiD: Estimation (TWFE)

  • Standard DiD estimation is two-way fixed effects (TWFE) regression \[ Y = \gamma_i + \gamma_t + \beta Treated + \varepsilon \]
  • Why this works is easy to see if we limit it to a “2x2” DID \[ Y = \gamma_i TreatedGroup + \gamma_t After + \beta TreatedGroup\times After + \varepsilon \]
  • \(\gamma_i\) is prior-period group diff, \(\gamma_t\) is shared time effect, and \(\beta\) is how much bigger the \(TreatedGroup\) effect gets after treatment vs. before, i.e. how much the gap grows (Difference-in-Differences!)

Example 1

  • As a quick example we’ll use data(injury) from library(wooldridge)
  • This is from Meyer, Viscusi, and Durbin (1995) - In Kentucky in 1980, worker’s compensation law changed to increase benefits, but only for high-earning individuals
  • What effect did this have on how long you stay out of work?
  • The treated group is individuals who were already high-earning, and the control group is those who weren’t

Example 1

data(injury, package = 'wooldridge')
injury <- injury %>%
  filter(ky == 1)  %>% # Kentucky only
  mutate(Treated = afchnge*highearn)
m1_did <- feols(ldurat ~ Treated | highearn + afchnge, data = injury)
msummary(m1_did, stars = TRUE, gof_omit = 'FE|RMSE|R2|AIC|BIC|Lik|Adj|Pseudo')
tinytable_gbxpl3evx40kwpp0bajt
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Treated 0.191**
(0.069)
Num.Obs. 5626
Std.Errors IID

Example 1: Stata Replication

* Load the data
import delimited "injury_ky.csv", clear

* TWFE DiD regression with fixed effects for group (highearn) and time (afchnge)
reghdfe ldurat Treated, absorb(highearn afchnge) vce(robust)

* Alternative without reghdfe:
* reg ldurat Treated i.highearn i.afchnge, vce(robust)

Example 2: Dynamic DiD

  • Often estimate a dynamic effect where we allow effect to be different at different lengths since the treatment
  • Simply interact \(TreatedGroup\) with binary indicators for time period (last period before treatment is the reference) \[ Y = \gamma_i + \gamma_t + \beta_t TreatedGroup + \varepsilon \]
  • Typically plot the \(\beta_t\) coefficients to see how effect evolves over time

Example 2: Dynamic DiD

library(dplyr)
library(fixest)
library(ggplot2)
library(readr)
df <- read_csv('data/eitc.csv') %>%
  mutate(treated = 1*(children > 0)) %>%
  mutate(year = factor(year))
# assert that '1993' is a level of year
stopifnot('1993' %in% levels(df$year))
m <- feols(work ~ i(year, treated, ref = '1993') | treated + year, data = df)
coef_plot <- ggcoefplot(m, ref = c('1993' = 3), pt.join = TRUE) +
  labs(title = "Dynamic Difference-in-Differences Estimates of EITC on Work",
       x = "Year",
       y = "Coefficient Estimate (ref: 1993)") +
  theme_minimal() +
  theme(plot.title = element_text(size = 24),
        axis.text = element_text(size = 18),
        axis.title = element_text(size = 18))

Example 2: Dynamic DiD

General Tips

During Research

During Writing

  • Export tables and figures directly from code. No screenshots!
  • Include examples using texreg (R) or estout (Stata)